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Perform manual review 

and completion of 
generated XML structure 
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See Fig. 3: SID Schema 
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See Fig. 2 & Fig. 6. 
XML-compatible markup is 
stored within the given richtext 
document via the API of the host 
XML-enabled wordprocessor. 
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editorial and workflow 
functions 



end of structuring/ 
conversion cycle 



f~ 28 



Obtain pure XML via 
host's Save As XML or 
Export XML function 



Figure 1 
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See SID schema in Fig. 
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Figure 4: Example of Baseline Elements 
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Figure 5: Example of Element Baseline in a document schema 
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Run core structure 
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Figure 6: Conversion/structuring process: document parsing, structure inference, and 

XML markup creation 
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Context CMG := 
outer CMG of root 
element's type 



CMG = schema content model group 



-recursive call- 
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for each child (schema 
content particle) of 
current Context CMG 



Context CMG := 
outer CMG of element's 
type (if complex type) 
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Create BESM node; 
remember schema 
(recursion) context; link to 
baseline element 
definition 



Context CMG := 
current child group 



Combine child BESM nodes 
by applying Order and Group 
Occurrence specifiers from 
parent (context) CMG 
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recursive call/ 



Figure 7. BESM Construction 




Figure 8. Data-structure relations between baseline elements, the BESM, 

and the XML schema 



fullname 
of <author>\element] 





start of new documeriKsection] 
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title [document section start] 




Note: The BESM transitions shown here are simple 
element names. In reality, they are full schema paths 
(starting from the defined root element). 



Figure 9. BESM fragment samples 
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TCT = tentative conversion tree 

CPR = cumulative plausibility rating (180) 
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Determine leading 
conversion step (leaf 
TCT node with highest 
CPR) 
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Fig. 11: 180 
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Commit TCT root 




Discard all root branches 
except the one containing 
the current leading step; 
assign new TCT root 
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Obtain list of expected/ 
allowed [baseline] element 
transitions from BESM state 
of leading TCT node 
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Instance of 182 (Fig. 11). New BESM 
state is target of element transition: 
new document position is just past the 
matched element range; 
CPR += baseline element priority value 



M^tcfied any eleme" 
AND max element 
transition gain 
satisfactory? 



Instance of 184. Keep 
document position of parent 
node; 

new BESM state is target of 
skipped element transition; 
CPR -= skip elem penalty. 



Instance of 186. Keep 
BESM state of parent node; 
document position is 
beginning of next 
paragraph; 

CPR -= skip para penalty. 
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while TCT not empty and 

not reached end of 
\ document / 



178 



Figure 10: Core structure inference algorithm 



Tree Node 



-parent, first child, left sibling, right sibling : Tree Node 
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TCT Node 



cumulative plausibility rating : int 
avg transition gain : float 
+BESM state 

♦next document position : int 



Markup Conversion Step 



+baseline element : Baseline Element Definition 
+element range : Text Range 
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Skipped Element Conversion Step 



♦skipped element : Baseline Element Definition 
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Skipped-to-Next-Paragraph 
Conversion Step 



Figure 1 1 . Tentative conversion tree (TCT) nodes, holding information about tentative 
conversion steps within the core structure inference algorithm 
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(skip trans) \ (skip para) 



Figure 12: Tentative conversion tree sample 
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elemPath := full element path 
(starting from root element) of 
baseline element to be committed; 
prevElemPath := full element path 
of last committed baseline 
element; 



iStep := 1 
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Element range comes 
from "markup conv. step 1 
TCT node (182). 
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Fig. 3: 54, 62 



Considers the content model of elemPath[iStep-1], 
elemPath[iStep]'s multiplicity specifier, and whether 
each of elemPath[iStep..length(elemPath)] can start 
a new valid content model group (within its parent). 



Figure 13. Committing a TCT node: creating a baseline element, inferring and 
creating higher-level structure, and creating sub-baseline markup 
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S. mitis ts characterized by the presence, in each repeating unit, oftwo residues of 
phospho choline and both galactos amine residues in the N-acetylated form. 
Immunochemical analysis showed that C- polysaccharide constitutes the 
Lancefield group 0 antigen. Studies using mAbs directed against the backbone 
and against the phosphocholine moiety of the C-polysaccharide revealed several 
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